Make null results great again: A tutorial of equivalence testing

Dr James Bartlett and Dr Sarah Charles

Overview

  • Crash course in frequentist statistical inference

  • How do researchers often try and test the null?

  • Equivalence testing tutorial

  • Target data sets:

Crash course in frequentist statistics

  • Approach to statistical inference behind commonly used p-values

  • “Objective” theory where probabilities exist in the world, they are there to be discovered independent from the observer

  • Probability cannot be assigned to individual events, only a collective

  • We can calculate the probability of data given a hypothesis

Null Hypothesis Significant Testing

  • This is where p-values come in: We can calculate the probability of observing data (or more extreme), assuming the null hypothesis is true

  • See it as a measure of surprise:

    • Low probability (small p-value) = data would be surprising under the null

    • High probability (large p-value) = data would not be surprising under the null

  • We can either reject or retain the null hypothesis; we cannot accept the null

Neyman-Pearson approach

  • The dominant - but often unnamed - approach to hypothesis testing (Lakens, 2021)

  • Suitable when the null hypothesis is plausible / meaningful

  • Creates a decision procedure on how to act while controlling error rates: reject the null hypothesis or not?

    • Type I errors / false positives controlled through alpha (e.g., \(\alpha\) = .05)

    • Type II errors / false negatives controlled through beta (e.g., \(\beta\) = .20)

Can we reject the null hypothesis?

Figure from Lakens et al. (2018)

Function of p-values

  • Important to keep in mind what p-values can and cannot do (Wasserstein & Lazar, 2016)

    • p-values can indicate how incompatible the data are with a specified statistical model

    • p-values do not tell you the probability your alternative hypothesis is true

    • p-values do not measure the size of an effect or the importance of a result

    • Scientific conclusions should not solely be based on whether a p-value passes a given alpha threshold or not

Statistically or practically significant?

  • Bem (2011) published an infamous series of studies purporting to show precognition (psychic abilities)

    • 100 participants (study 1) saw two hidden windows: one empty and one containing an erotic/non-erotic figure

    • Participants had to guess which window contained the figure, where 0% would be never correct, 50% a coin flip, and 100% correct every guess

  • What success rate (%) would convince you someone had psychic abilities?

Bem’s results

  • For non-erotic images, participant’s hit rate was not significantly higher than chance, t (99) = -0.15, p = 0.884

  • However, for erotic images, participant’s hit rate was significantly higher than chance, t (99) = 2.51, p = 0.014

  • So, maybe we do have evidence for precognition (at least for predicting the future position of erotic images…), but what about the effect size?

  • Hit rate for erotic images was 53.1% (3.14% above chance); 49.8% for non-erotic images (0.17% below chance)
  • Compared to the level of evidence all of you would want to be convinced, at least 88% and higher…

What does Bem help to teach us?

  1. The difference between a significant and non-significant result may not represent a meaningful shift (Interaction fallacy; Gelman & Stern, 2006)

  2. Even when a result is statistically significant, the effect size might be entirely meaningless (Meehl’s paradox; Kruschke & Liddell, 2018)

  3. It is important to keep in mind whether the null hypothesis is plausible / meaningful for your study (Crud factor; Orben & Lakens, 2020)

What if you want to support the null?

  • With these lessons in mind, there are scenarios when supporting the null is a desirable inference:
    • Is there no meaningful difference between two competing interventions?

    • Does your theory rule out specific effects?

    • Is your correlation too small to be meaningful?

  • However, researchers mistakenly conclude null effects via a non-significant p-value (Aczel et al., 2018, Edelsbrunner & Thurn, 2020)

Our project

  • Inferences in Psychology Teaching and Learning: A Review of Statistics Misconceptions

  • Unfortunately little progress in reviewing 76 articles…

  • Our RQ: Can studies in psychology teaching and learning meet their inferential goals?

    • What is the prevalence of misconceptions in interpreting non-significant results in psychology teaching and learning?

    • How does research in psychology teaching and learning justify their sample sizes?

Equivalence testing

  • No statistical approach can directly support the null hypothesis of exactly 0

  • Equivalence testing is one approach and originates from drug development research

  • Equivalence testing flips NHST logic and uses two one-sided t-tests to test your effect against two boundaries:

    • Is your effect significantly larger than a lower bound?

    • Is your effect significantly smaller than an upper bound?

Equivalence testing logic

Figure from Lakens et al. (2018)

Equivalence testing decisions

Figure from Lakens (2017)

Decisions to make

Alpha

  • Default of .05 (which you can change), which creates a 90% confidence interval since there are two tests

Equivalence bounds

  • Your smallest effect size of interest as raw or standardised values

Sample size

  • Power analysis based on alpha, desired power, and equivalence bounds

TOSTER R package

  • Flexible R package (Lakens & Caldwell) that can apply equivalence or interval testing to focal tests:

    • T-tests

    • Correlations

    • Meta-analysis

    • Non-parametric tests

install.packages("TOSTER")

Worked example

  • Technology or Tradition? A Comparison of Students’ Statistical Reasoning After Being Taught With R Programming Versus Hand Calculations (Ditta & Woodward, 2022)

  • Compared conceptual understanding of statistics at the end of a 10-week intro course

  • Students completed one of two versions:

    1. Formula-based approach to statistical tests (n = 57)

    2. R code approach to statistical tests (n = 60)

  • Research question (RQ): Does learning through hand calculations or R code lead to greater conceptual understanding of statistics?

  • Between-subjects IV: Formula-based or R code approach course

  • DV: Final exam (conceptual understanding questions) score as proportion correct (%)

What are we working with?

Their main results

  • Their first approach to the analysis was a simple independent samples t-test:
t.test(e3total ~ condition, 
       data = Ditta_data)

    Welch Two Sample t-test

data:  e3total by condition
t = -1.117, df = 110.97, p-value = 0.2664
alternative hypothesis: true difference in means between group HC and group R is not equal to 0
95 percent confidence interval:
 -7.584355  2.116173
sample estimates:
mean in group HC  mean in group R 
        69.29091         72.02500 

Equivalence test for two groups

  • The traditional t-test was non-significant, but was there no meaningful difference?

  • We can apply an equivalence test using bounds of ±10% for our smallest effect size of interest

TOST_10 <- tsum_TOST(m1 = m1, # Group 1: Hand calculations
          sd1 = sd1, 
          n1 = n1,
          m2 = m2, # Group 2: R 
          sd2 = sd2,
          n2 = n2, 
          low_eqbound = -10, # User defined equivalence boundaries
          high_eqbound = 10, 
          alpha = .05)
  • Using bounds of ±10%, we can conclude the effect is statistically equivalent and not significantly different to 0:

Welch Modified Two-Sample t-Test
Hypothesis Tested: Equivalence
Equivalence Bounds (raw):-10.000 & 10.000
Alpha Level:0.05
The equivalence test was significant, t(110.97) = 2.968, p = 1.83e-03
The null hypothesis test was non-significant, t(110.97) = -1.117, p = 2.66e-01
NHST: don't reject null significance hypothesis that the effect is equal to zero 
TOST: reject null equivalence hypothesis

TOST Results 
                   t       SE     df      p.value
t-test     -1.117011 2.447684 110.97 2.664022e-01
TOST Lower  2.968483 2.447684 110.97 1.833635e-03
TOST Upper -5.202506 2.447684 110.97 4.542552e-07

Effect Sizes 
                estimate       SE   lower.ci  upper.ci conf.level
Raw           -2.7340909 2.447684 -6.7940673 1.3258855        0.9
Hedges' g(av) -0.2073357 0.188892 -0.5208061 0.1001411        0.9

Note: SMD confidence intervals are an approximation. See vignette("SMD_calcs").
  • We can also get a plot showing the equivalence test for both raw and standardised units:
# Plot using the equivalence test object
plot(TOST_10)

Setting equivalence bounds

Theory / subject knowledge

  • Maybe our intervention would need to improve performance by at least a grade band (10%)?

Small telescopes approach

  • Often used in replication studies: The effect size the original study would have 33% power to detect

Effect size benchmarks

  • In the absence of other information, what effect size distributions are relevant to your topic?

Small telescopes

  • Ditta and Woodward had 33% power to detect effects of d = ± 0.28; the effect would not be equivalent and we would need more data

Effect size benchmarks

  • Mean effect size in pre-registered between-subjects studies was d = 0.35 (Schäfer & Schwarz, 2019), which would not be equivalent

Summary

  • Null hypothesis significance testing and p-values are suited to specific roles

  • If supporting the null is a desirable inference, you need techniques like equivalence testing

  • This allows you to conclude whether effects are statistically equivalent or not

  • Setting equivalence bounds is the hardest decision which you must transparently justify

Where to go next

Thank you for listening!

Any questions?

Dr James Bartlett

  • @JamesEBartlett

  • james.bartlett@glasgow.ac.uk

Dr Sarah Charles

  • @SarahCharlesNC

  • sarah.charles@ntu.ac.uk